Search CORE

16 research outputs found

In-Datacenter Performance Analysis of a Tensor Processing Unit

Author: Agrawal Gaurav
Bajwa Raminder
Bates Sarah
Bhatia Suresh
Boden Nan
Borchers Al
Boyle Rick
Cantin Pierre-luc
Chao Clifford
Clark Chris
Coriell Jeremy
Daley Mike
Dau Matt
Dean Jeffrey
Gelb Ben
Ghaemmaghami Tara Vazir
Gottipati Rajendra
Gulland William
Hagmann Robert
Ho C. Richard
Hogberg Doug
Hu John
Hundt Robert
Hurt Dan
Ibarz Julian
Jaffey Aaron
Jaworski Alek
Jouppi Norman P.
Kaplan Alexander
Khaitan Harshit
Koch Andy
Kumar Naveen
Lacy Steve
Laudon James
Law James
Le Diemthu
Leary Chris
Liu Zhuyuan
Lucke Kyle
Lundin Alan
MacKean Gordon
Maggiore Adriana
Mahony Maire
Miller Kieran
Nagarajan Rahul
Narayanaswami Ravi
Ni Ray
Nix Kathy
Norrie Thomas
Omernick Mark
Patil Nishant
Patterson David
Penukonda Narayana
Phelps Andy
Ross Jonathan
Ross Matt
Salek Amir
Samadiani Emad
Severn Chris
Sizikov Gregory
Snelham Matthew
Souter Jed
Steinberg Dan
Swing Andy
Tan Mercedes
Thorson Gregory
Tian Bo
Toma Horia
Tuttle Erick
Vasudevan Vijay
Walter Richard
Wang Walter
Wilcox Eric
Yoon Doe Hyun
Young Cliff
Publication venue
Publication date: 16/04/2017
Field of study

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Flexible and efficient reliability in memory systems

Author: Yoon Doe Hyun
Publication venue
Publication date: 22/06/2011
Field of study

textFuture computing platforms will increasingly demand more stringent memory resiliency mechanisms due to shrinking memory cell size, reduced error margins, higher capacity, and higher reliability expectations. Traditional mechanisms, which apply error checking and correcting (ECC) codes uniformly across all memory locations, are inefficient -- Uniform protection dedicates resources to redundant information and demand higher cost for stronger protection, a fixed (worst-case based) error tolerance level, and a fixed access granularity. The design of modern computing platforms is a multi-objective optimization, balancing performance, reliability, and many other parameters within a constrained power budget. If resiliency mechanisms consume too many resources, we lose an opportunity to improve performance. Hence, it is important and necessary to enable more efficient and flexible memory resiliency mechanisms. This dissertation develops techniques that enable efficient, adaptive, and dynamically tunable memory resiliency mechanisms. First, we develop two-tiered protection, apply it to the last-level cache, and present Memory Mapped ECC (MME) and ECC FIFO. Two-tiered protection provides low-cost error detection or light-weight correction in the common case read operations, while the uncommon case error correction overhead is off-loaded to main memory namespace. MME and ECC FIFO use different schemes for managing redundant information in main memory. Both achieve 15-25% reduction in area and 9-18% reduction in power consumption of the last-level cache, while performance is degraded by only 0.7% on average. Then, we apply two-tiered protection to main memory and augment the virtual memory interface to dynamically adapt error tolerance levels according to user, system, and environmental needs. This mechanism, Virtualized ECC (V-ECC), improves system energy efficiency by 12% and degrades performance only by 1-2% for chipkill-correct level protection. V-ECC also supports ECC in a system with no dedicated storage for redundant information. Lastly, we propose the adaptive granularity memory system (AGMS) that allows different access granularities, while supporting ECC. By not wasting off-chip bandwidth for transferring unnecessary data, AGMS achieves higher throughput (by 44%) and power efficiency (by 46%) in a 4-core CMP system. Furthermore, AGMS will provide further gains in future systems, where off-chip bandwidth will be comparatively scarce.Electrical and Computer Engineerin

Texas ScholarWorks

Flexible Cache Error Protection using an ECC FIFO

Author: Doe Hyun Yoon
Mattan Erez
Publication venue
Publication date: 01/01/2009
Field of study

We present ECC FIFO, a mechanism enabling two-tiered last-level cache error protection using an arbitrarily strong tier-2 code without increasing on-chip storage. Instead of adding redundant ECC information to each cache line, our ECC FIFO mechanism off-loads the extra information to off-chip DRAM. We augment each cache line with a tier-1 code, which provides error detection consuming limited resources. The redundancy required for strong protection is provided by a tier-2 code placed in off-chip memory. Because errors that require tier-2 correction are rare, the overhead of accessing DRAM is unimportant. We show how this method can save 15 − 25 % and 10 − 17 % of on-chip cache area and power respectively while minimally impacting performance, which decreases by 1 % on average across a range of scientific and consumer benchmarks

CiteSeerX

Crossref

Dynamic Power Supply Current Testing for Open Defects in CMOS SRAMs

Author: Doe-hyun Yoon
Hong Sik Kim (e-mail
Hong-sik Kim
Sungho Kang
Publication venue
Publication date
Field of study

The detection of open defects in CMOS SRAM has been a time consuming process. This paper proposes a new dynamic power supply current testing method to detect open defects in CMOS SRAM cells. By monitoring a dynamic current pulse during a transition write operation or a read operation, open defects can be detected. In order to measure the dynamic power supply current pulse, a current monitoring circuit with low hardware overhead is developed. Using the sensor, the new testing method does not require any additional test sequence. The results show that the new test method is very efficient compared with other testing methods. Therefore, the new testing method is very attractive

CiteSeerX

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems

Author: Doe Hyun Yoon
Dong Wan Kim
Ikhwan Lee
Jee Ho Ryoo
Jinsuk Chung
Larry Kaplan
Mattan Erez
Michael Sullivan
Publication venue: 'IOS Press'
Publication date: 01/01/2013
Field of study

This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches

Directory of Open Access Journals